A global analysis of matches and mismatches between human genetic and linguistic histories

Significance There has been considerable debate about the extent to which our biological and linguistic histories match to each other, supported by examples of both matches and mismatches. We introduce a genomic database (GeLaTo, or Genes and Languages Together) to quantify matches and mismatches worldwide. While in most populations genetic and linguistic relations match, mismatches occur regularly as a result of language shift, and several language families follow diversification patterns different from that of the genomes. These findings reveal features of population contact in human history that were previously inaccessible to observation. Our database opens avenues for disentangling demographic and linguistic history and for comparing biological and linguistic modes of evolution.


Dataset description
We introduce the database GeLaTo -Genes and Languages Together, which provides the database for this study. The panel of populations analyzed is typed for the Human Origins Array (Affymetrix), a set of SNPs selected for population history studies and ascertained against the genomic diversity found in 11 individuals from different continents (1). For the analysis, only autosomal chromosomes are considered, to balance out the female/male ratio per population (593,124 SNPs used). The population samples included come from previously published genetic studies (1-13). We included only populations with a minimum of 5 individuals for a total of 397 populations and 4030 individuals with a minimum of 550,000 SNPs successfully genotyped. Missing data is ~0.1%.
All the genetic populations considered are matched with a unique Glottocode identifier (14), which corresponds to the main language spoken by the population. This information is recovered after screening the original genetic publication, and it is extrapolated either from direct sampling observation, cultural/linguistic self-identification, or geographical characterization. The proposed Glottocodes are checked by linguists and anthropologists (for a list of people who contributed expertise, see the Acknowledgments section in the main text). Populations who mainly speak a language introduced during colonial ages (widely diffused trans-national languages) are not considered for this analysis, to exclude the wave of historical language shift documented in the past ~2 centuries and keep the results conservative. Linguistic relatedness between populations corresponds to speaking languages of the same language family. Language families are therefore considered as the highest level of genealogical relatedness. Language family assignment follows Glottolog groupings. Glottolog classification is based on conservative methods that reject connections between languages not supported by strong evidence. Further historical and datadriven revisions of Glottolog, in particular on understudied regions and languages, might unveil further relatedness between languages and language families.
Our dataset contains 53 Language families and 295 languages with unique Glottocodes. Of the 53 language families, 11 are isolates, i.e. the language has no other member of its family. This is about half of the incidence of isolates in the entire Glottolog database (14), where around 40% of the language families are isolates. The reduced proportion in GeLaTo reflects the lack of genetic coverage in regions with exceptionally high cultural and linguistic diversity, such as in the Americas and Papunesia. In some instances, the same language is spoken by more than one genetic population sample. Geographic locations used through the analysis correspond to the location of the genetic populations sampled, manually curated with information from the original publications.
For this study, we use the classic Weir and Cockerham FST measurement of genetic distance between populations (15) calculated with software PLINK v. 1.9 (16) and a script available at https://github.com/epifaniarango/Fst-for-GeLaTo. Further scripts for assembling GeLaTo, with PLINK commands and data processing and screening in R (17) are available at https://github.com/gelato-org/gelato-data/blob/master/AssembleGeLaTo_2022_MaMi.R. For each population we calculate 1) the median FST between each population and any other population of the dataset, 2) the median FST between each population in their macro geographic region, and 3) the median FST between each population and their geographic neighbors in a radius of 1,000 km. Population metadata and genetic profiles are listed in Dataset S1. Pairwise comparisons between populations who live within a radius of 10 km and speak the same language have been discarded as these would produce a bias in matching patterns. These pairs of populations can be considered as duplicates, but they are in most cases genetically distinct, except in the case of the two Hungarians, two Kove and two Sulka, who share an FST = 0 with each other. As they correspond to distinct sampling studies, and they could display slightly different genetic characterizations, we do not merge the pairs into a single population.
The divergence time between two populations (as generations ago) is proportional to the FST and the effective population size (Ne) of the ancestral population with a formula equivalent to Time = 2Ne * Linearized FST (18). A variation of this equation implemented in (19) is also considered: the two estimates return corresponding results (Fig. S1A). For the rest of the analysis, the formula from Nei is used. Divergence time in years ago is calculated with a generation time of 29 years. To calculate this approximate divergence time, we need an estimate of the ancestral effective population size Ne before the split. This value is calculated from the present time Ne of the two populations. Present time Ne, in turn, is often calculated as proportional to heterozygosity. Sparse SNP data, such as the one utilized in the present study, do not adequately cover the invariant sites of the genome, and therefore cannot yield an absolute heterozygosity value. To overcome this issue, here we utilize an approach based on Identity by Descent blocks, which are shared by individuals as inherited from a common ancestor. From the size and the number of Identity by Descent blocks it is possible to reconstruct the number of shared ancestors and infer variation in Ne through time: this rationale is utilized by the software IBDNe (20). Identity by Descent blocks are retrieved after phasing the data with Beagle and running refinedIBD and its associate tools for gap removals (21). IBDNe is run over the output of refinedIBD, using the blocks shared within each population. The harmonic mean of all the Ne from the last 50 generations is used to approximate Ne (this would minimize the effect of increase or decline in the last 10-20 generations). Populations with an Ne >10,000 are not considered, as such exceptionally high Ne can be resulting from population substructure. Ne values are kept only if the reconstructed variation of Ne is relatively stable in time, without a large increase or decline. Populations having very large confidence intervals associated with their Ne were then further excluded, leaving 164 populations that can be used for Ne estimations and calculations of pairwise divergence times.
To calculate the ancestral population size Ne, we average the Ne of the two populations. A second calculation is performed which uses the harmonic mean of the two Ne, which is smaller than the arithmetic mean and more affected by small values. This second formula aims at balancing the result in case one of the two populations has a very large Ne. The divergence times calculated with the two ancestral Ne reconstructions are compared in Figure S1B. Pairwise distances (genetic, geographic, and divergence times) are annotated in Dataset S2. Comparison between two formulas considered to calculate divergence time from linearized FST and effective population size Ne, from Nei (18) and Bhatia et al. (19). B. comparisons with two ways of calculating the ancestral effective population size Ne between two populations: with the arithmetic mean and with the harmonic mean between the Ne of the two populations in the pair.
1a. Overview of GeLaTo dataset: coverage, language family distribution, genetic relatedness, spatial autocorrelation, time divergences The structure of the GeLaTo database is inspected and described by looking at networks of close relatedness, geographic coverage and representativeness of different language families, incidence of spatial autocorrelation between genetic distances and reliability of time divergence estimate. The continental structure of human genetic and linguistic diversity, and the intrinsic characteristic of database coverage, can bias the identification of matches and mismatches.
The global network of genetic relatedness is displayed on a map in Figure S2. Small FST distances are weighted according to their percentile in the FST distribution. We calculate different FST density distributions for each continent, separating Africa, the Americas, Eurasia, Southeast Asia and Oceania. We consider the FST distribution within each macro continent for distances within each continent, and the FST distribution of the whole database for distances between continents, setting a series of percentile thresholds, from 0.02% to 50%, with a pace of 0.02. Each pairwise FST distance is then assigned to the corresponding percentiles in the FST distribution accordingly. Eighteen outlier populations (with average and regional FST distances above 0.1) are considered as "drifted" populations and excluded from FST distribution and percentile calculations, and subsequent analysis based on FST distribution comparisons. In Figure S2, the smallest FST distances (belonging to the lowest 10% percentile) are displayed as lines connecting the two populations involved. A dense network of close genetic relatedness is visible in Europe, reaching also North Africa and the eastern Mediterranean. Close genetic connections over long distances are found in central Asia, along Eastern Asia, between Mesoamerica and Western South America, and in sub-Saharan Africa.
To explore the power of the GeLaTo dataset in representing linguistic diversity, for each population we count the number of neighboring populations from a different language family, within a radius of 1000 km (Fig. S3A). 85% of the populations in GeLaTo do have at least one linguistically unrelated neighbor population within this radius. The median number of linguistically unrelated neighbors is five. The highest number of linguistically unrelated neighboring populations is found in the Caucasus, the Ural Mountains, Oceania, South Africa and the Mediterranean. The availability of linguistically unrelated neighbors is then explored for varying geographic radii, from 500 to 3000 km. As expected, the number of linguistically unrelated neighbors increases linearly with larger radii (Fig. S3 panel B), with more than 98% of the populations having at least one linguistically unrelated neighbor for distances > 2000 km. To represent the linguistic diversity of GeLaTo, we also count the number of different language families within a radius of 1000 km ( Fig. S3 panel C).
For the analyses associated with Figure 2, we use the Isolation By Distance (IBD) model, which predicts a correlation between genetic and geographic distances (22). This model describes a gradient of genetic distances and is opposed to a scenario of strong genetic structure between populations. Attempts to estimate the strengths of IBD usually rely on the Mantel Test, which can be biased by the effects of hierarchical population structure (23). We expect our human genetic dataset to be affected by a combination of IBD and regional substructure connected by gene flow. We explore spatial autocorrelation effects with distance-based Moran Eigenvector Maps (dbMEM) (24). The RsqAdj is 0.89, suggesting that a large proportion of the genetic variation can be attributed to spatial patterns. The first four components are shown in Figure S4. The first vector highlights a strong spatial autocorrelation in South America and Asia. The second reveals a positive correlation again in the Americas, moderate in Northern Asia, and high in Europe and North Africa. The third vector of genetic and geographic correlation finds high values in Europe, North Africa and the Middle East. The fourth vector shows spatial autocorrelation in the Americas, the Middle East, Africa and Oceania. These components of geographic and genetic correlation are strong within separate continental regions and are the result of large-scale human migrations.
FST distances have been roughly converted into pairwise split divergence times, by accounting for effective population size. Excessively high population sizes can be explained by admixture and drift: populations associated with such profiles are filtered out, as well as populations that do not display a stable population size. To mitigate the effects of high population sizes in the comparisons, we also include comparisons based on the harmonic mean of the two Ne (Fig. S1B). These divergence times do not account for migration exchanges after population split and should be taken as an indicative temporal frame for the separation between populations. Furthermore, because our method is based on IBD blocks shared from common ancestors, the ability to reconstruct population size variation becomes less reliable after 50 generations (20): this might affect the accuracy of the oldest divergence times reconstructions. The divergence time distribution for each continent is shown in Figure S5. The congruences and limitations of our divergence time reconstructions have been examined by comparing continental profiles against available knowledge on the genetic history of the continents. Eurasia divergence times are pushed as far back as 60 thousand years ago (kya), in line with the history of colonization and dispersal after the Out of Africa event (25). The split times in the Americas are below 17 kya, also in line with accredited reconstructions of the major peopling of the continent (26). The peopling of the Oceanic region started at ~5,000 years ago, but a more ancient "Papuan" ancestry admixed in current populations (27), creating larger genetic differences over the region which we see in our dataset (such as split times up to 10 kya). In Africa, our estimates are mostly below 30 kya: this date is too recent to account for the ancient structure of the continent. Genetic studies reconstructed population splits of ~100 kya between Eastern and Western Africa, and older than 200 kya for the San hunter-gatherers (Tuu and K'xa speakers) and the rest of the continent (28). The effect of migration and contact occurring over the African continent must have affected our power to reconstruct accurate deep divergence times for this continent; as stated before, deep time split events cannot be estimated reliably with the method used. With the harmonic mean formula, divergence times are noticeably more recent for Oceania, Southeast Asia and the Americas (Fig. S5B). Within these continents, Southeast Asia and the Americas display divergence times too recent to be compatible with their colonization history. The incidence of isolated populations with extremely small population sizes in these continents might be excessively magnified by the harmonic mean.

Mismatches between genetic and linguistic relatedness 2a. Global overview of close genetic distances
The GeLaTo dataset is screened for pairs that are genetically close but linguistically distant, by looking at small FST distances between speakers of unrelated languages (case exemplified in Fig.  S6A). Each FST distance is associated with a percentile range value in the overall continental/global FST distribution (see section above). As FST distances are differently distributed in broad macroregions, due to different colonization processes and the Out of Africa effect, we used macro continents as distinct units of analysis for adjusting the summary statistics in global comparisons. These were shown as separate blocks of spatial autocorrelation in the main eigenvectors of the dbMEM analysis (Fig. S4). We count how many pairs have an FST below each percentile threshold. For these pairs, we count the proportion of pairs from different language families. Finally, for each population, we annotate the smallest FST percentile threshold value to a population from a different language family (Dataset S1) and for each pair, we annotate the correspondent percentile (Dataset S2).
This analysis gives a global overview of the genetic relatedness across language families (Fig.  S6B). The proportion of genetically close and linguistically unrelated pairs is expected to increase with larger FST threshold percentiles. At the lowest FST threshold considered (0.02% of the distribution), 4% of the pairs speak languages from different families (Fig. S6C). This potentially corresponds to a mismatch in gene-language vertical transmission. For FST larger than the 0.14 percentile distribution, more than half of the pairs are composed of populations from different families (Fig. S6C). Fig. S6D shows how the genetically close and linguistically unrelated pairs are distributed between 15 major language families, represented by more than 5 genetic populations. The major axes of mismatches are between language families in Eurasia: a western network includes Indo-European, Turkic, Abkhaz-Adyge and Uralic speaking populations, and a more eastern one includes Tungusic, Sino-Tibetan and Mongolic-Khitan speaking populations. These networks can be shaped by cases of population contact and episodes of shifts, which will be further discussed in the following sections.  Two heuristic descriptors are used to flag populations that show a mismatch between genetic and linguistic relatedness. The first one is a conservative analysis in which each population is tested for having a close genetic relatedness with speakers of the same language family, above geographic distance. We select only populations who belong to a language family represented in GeLaTo by more than 2 populations, and for each of them we annotate 1) the closest FST to speakers of the same language family (or same language, for the case of language isolates) and its relative geographic distance and 2) the closest FST to speakers of a different language family and its relative geographic distance. These values are reported in Dataset S1. If 2) is smaller than 1), and the population of 2) is geographically more distant than 1), the target population is flagged as presenting a "mismatch" with their linguistic and geographic neighbors and called an enclave. If the situation is inverted, the target population is genetically close to a geographically distant linguistic relative, and therefore is a "match" with other speakers of the same family despite the geographic distance. This is a conservative way to spot genetic migrants, who might have changed their original language to the language of their neighbors but maintained genealogical ties with the original group, which is now linguistically distant. On the other side, this test can be used to prove cases of matches that persist beyond geographic distance.
To spot the opposite case of mismatch, the linguistic migrants, we must search for populations that have very close genetic distances to their neighbors who speak an unrelated language. As seen in Figure S6, the distribution of FST varies across the continent, and it is not possible to establish a single threshold of relatedness equally meaningful for the different regions and population histories. A conservative approach is to search for populations that have an FST = 0 to populations from a different language family. An FST of zero would correspond to sharing the same gene pool, in complete panmixia (the variance between populations is equal to the variance between the individuals of each population). If the FST to other members of the same language family is higher, the mismatch in their linguistic affiliation is confirmed.
The 27 cases of genetic enclaves reported are listed in Table S1. Their genetically closer population from a different language family is also annotated. Only one case of linguistic mismatch/linguistic enclave can be identified: the Hungarians, here represented by two population samples, which are genetically indistinguishable from their Indo-European neighbors (29,30). Fifty-two populations of the dataset are classified as matches above geographic distance. Twenty populations do not have a neighbor from the same language family. The population enclaves in Table S1 are ordered for geographic distance with the closest population from a different language family. The list includes the population identified as "Jew Georgian", Jewish immigrants who adopted a language from the Caucasus; Khomani, a group living in a region of South Africa where Khoe groups were dominant, but speaking a language of the Tuu family; the two populations speaking Yukagir, genetically closer to Turkic and Tungusic speakers of Siberia than to each other; Wayku, lowland Quechua speakers genetically closer to another distant Amazonian population (Cocama) than to the neighboring Andean Quechua speakers; and one Basque population from Spain, genetically closer to other Spanish speaking groups than to the neighboring Basque speaking populations of the datasetthis corresponds to a particular case applied to a linguistic isolate.  Table S1. List of populations flagged as genetic enclaves and linguistic enclaves (the latter corresponding to two Hungarian genetic populations, in italic). The last three columns show features associated with the second heuristic criteria for mismatch, the misalignment of FST distributions. NA indicates that the number of available comparisons is too small to calculate a distribution of FST within language family and between language families. If both the median and the lowest CI of the difference in between-within family comparisons are positive (highlighted in boldface), the population is in alignment with their linguistic relatives, in opposition to the status of mismatch from the genetic enclave assignation.
However, many of these single mismatch cases are non-informative. In the Americas, some populations from the Tupian or Uto-Aztecan families are genetically close to Maya, but the Maya group shares genetic similarities all over the continent, probably due to their less drifted genetic profile and/or to a proposed gene flow from Mesoamerica to the south (31). Nama speakers are also on the list, because of their genetic proximity with Khomani, but the latter is instead the one historically matching the scenario of a language shift. Nama speakers have been described as genetically similar to southern Tuu speakers, living in regions from where the Nama originally came from (32,33). Han speakers are genetically similar to Koreans, but is rather the latter who is driving this connection because of their genetic similarity to continental Asia; furthermore, the Koreanic language family, which is a very small family represented by two languages, is not represented by any other genetic population and cannot be flagged as mismatching by this method. Finally, Bengali are genetically close to a Dravidian group but are geographically isolated, and their closest neighbor is at more than 700 km of distance, thus making the comparison less informative. These examples show the potentials and limitations of this strict search for mismatching populations, which is heavily influenced by the structure of the dataset. First, the range of geographic neighbors: if the closest geographic neighbors are too distant, the matching is not informative (cases in the bottom rows of Table S1). Second, the population from a different language family which is genetically close but geographically distant can be the "exceptional" one driving this mismatch: either because it has very close FST with many populations, even at large geographic distances (this is the case of Maya, driving most of the possible mismatches found in South America), or because is the one possibly having experienced the language shift.

2c. Single population mismatches: misaligned FST distributions
For our second screening of population matches and mismatches, we compare FST distributions within and between language families. This overview would account for a degree of overlap in the two distributions and accept a more flexible and realistic scenario. Each target population was tested for their distribution of FST distances with a) speakers of the same target language family and b) speakers of different language families. In principle, most of the populations of the dataset are expected to show a smaller FST with the speakers of their language family in comparison to the FST of all the linguistically unrelated populations all over the continents. To circumscribe the test to a similar baseline of potential genetic relatedness, a geographic maximum radius was applied to this comparison. This threshold corresponds to the maximum geographic distance between the target population and the other populations of the same language family. A minimum radius of 500 km is applied for language families with a small geographic extension. 64 populations have negative values of median FST between-median FST within, over 316 populations for which the two medians can be calculated (20%).
The highest values of alignment (higher between-language family FST) are found in Africa for the Atlantic-Congo speakers, which have high FST distance with most hunter-gatherer groups in southern Africa speaking languages from the Tuu and Kx'a families ( Fig. S7-8). A high number of misaligned populations is present in the Caucasus, and a smaller number is found in Europe ( After this screening, some of the genetic enclaves discussed above are not immediately confirmed as mismatches, as they have median FST distances within language families smaller than those between language families. In addition, 13 populations also have a positive lower Confidence Interval associated with the difference between and within language family FST (the last columns in Dataset S1). These 13 populations were previously flagged as genetic enclaves but, when placed in context, appear overall genetically close to their linguistically related populations. These 13 populations are Jewish from Georgia, Han, Yoruba, Mongola, Bengali, Nama, Guarani, Bengali, Khomani, Avar Gunibsky, Evenk Far East and one of the two Hungarian populations. Four populations previously classified as Matches qualify as misaligned under the FST distribution criteria: these are Madak, Santa Cruz, Atayal and Karachai. In contrast, 65 cases of mismatches with misaligned FST distributions are found, of which six were previously flagged as enclaves (Hazara, Cocama, G|ui, Azeri Azerbaijan, Mengen and one of the two Hungarian populations). Relevant cases of misaligned populations are those also associated with a small confidence interval. Populations with a negative difference in median FST and an associated CI < 0.01 are Maltese, Chechen, Ingushian, Tatar_Mishar, Hazara and Cochin Jews. In Africa, Naro and ǂHoan are also relevant outliers as the only population with a negative difference of medians for the Khoe and Kx'a language families, respectively (Fig. S7B, S8). These two populations have previously been described as undergoing language shifts with evidence of substantial linguistic contact and sociocultural power imbalance (2,34).
To address the over representation of some language families, we perform a downsampling sensitivity test. We randomly selected a maximum of 8 populations per language family, and then calculated the proportion of populations: 1) close to a linguistically unrelated population, 2) matching enclaves, 3) genetic/linguistic enclaves, and 4) aligned/misaligned populations.
We repeated this procedure 100 times. The subsampled datasets of 185 populations return a proportion of matches and misaligned populations compatible with that found in the whole dataset. However, the populations close to linguistically unrelated populations and the mismatches become more numerous with the random downsampling iterations, possibly more affected by less dense population coverage (Fig. S11).

Genetic and linguistic similarities in the historical timeline
For this analysis we compared the time frame of the genetic divergence against the proposed time frame of language divergence. We only considered pairs of populations that share a most recent common ancestor near the root of the language family. The timing of the genetic divergence distribution can in principle be compared to the proposed divergence time of the proto-language for each language family. It should be noted that not all the pairwise FST distances can be converted into divergence times, because of the limitations in reconstructing effective population size from Identity By Descent blocks explained in the method section above, however, many can be compared to the linguistic time estimates.
These comparisons are visualized in three figures. First, Figure 3 shows the distribution of genetic divergence times for major language families, excluding drifted populations and marking populations flagged as enclaves and/or misaligned with different symbols. Second, Figure S21A shows these comparisons based on the harmonic mean of the two Ne, which is associated overall with smaller divergence time estimates, to compare the results obtained with a slightly different formula (see Section 1 of the Supplementary). Third, Figure S21B shows the distribution of genetic divergence times for each language family represented by more than one pairwise divergence time, including drifted populations, which can potentially drive deeper divergence times with their large FST distances.
In Africa, the speakers of Afro-Asiatic languages show a median divergence time of ~3,200 years ago, with some pairs diverging as old as 5,000 and 7,000 years ago. This time frame is more recent than the one suggested by linguistic and archaeological reconstructions, which point towards a very ancient divergence time in the pre-Neolithic (35). The genetic divergence time is more similar to the time range reconstructed with the Generalized Bayesian Dating (GBD) method (36). The Atlantic-Congo language family, here represented by the genetically cohesive Bantu speaking groups, has the bulk of pairwise time divergences compatible with the demographic diffusion associated with a shift to agricultural subsistence starting ~4,000 years ago (37). The reconstructed times are compatible with the harmonic Ne estimates, but the regular mean Ne estimates include pairs that diverged further back in time. For the hunter-gatherer Kx'a, genetic divergence times are found around 2,500 years ago. A linguistic time depth for the family is difficult to reconstruct from historical sources, but the genetic divergence times available are compatible with the GBD results (36). For the neighboring Khoe-Kwadi, a possible origin and migration with pastoralist groups is postulated. The migration is indicated from archaeological data to be at least older than 2,000 years ago (38), and corresponds to some of the genetic divergence dates reconstructed. Younger divergence cases fall within the timing suggested by the GBD methods.
In the Americas, Tupí populations are associated with very ancient divergence times around 7 kya, but the heterogeneous genetic composition and presence of the highly drifted Karitiana and Suruí, together with a relatively small sample size, suggests caution in interpreting the result. The high FST distances of these populations would be too ancient to be reconciled with the origin and spread of the family. A possible divergence time of ~3-5 kya has been proposed for the Tupí language family expansion, based on glottochronological data (36,39), and archaeological evidence associated with the agricultural transformation of the landscape and a putative Tupí pottery style (40)(41)(42). The Quechua family also presents cases of language shift, showing a genetically cohesive core in the central-southern Andes -where the family might have originated (43) -and where we see matches according to the stringent enclave criteria. This genetically cohesive core is contrasted with the presence of Quechua lowland speakers on the eastern slope of the Andes who have a distinct Amazonian ancestry (10). The overall divergence frame is too ancient to fit the time range from the GBD method and be reconciled with the historical paths that link the diffusion of the Quechua family to the early expansion of the Wari empire starting ~1,400 years ago (44,45).
In Eurasia, Uralic speakers do not show signals of genetic cohesiveness, as seen in the previous analysis. One of the oldest divergence times for this language family (as reconstructed with quantitative or historical linguistic methods) is of 6-7 kya, (46,47), while other authors suggest more recent dates of 4-5 kya (48). While a few comparisons could be consistent with the oldest date proposed, the tail of older population divergences in the genetic data confirms that most comparisons include populations with a very divergent genetic history. This finding suggests that genetically unrelated groups who diverged older than 7 kya ago adopted Uralic languages as a result of cultural exposure without substantial demographic influences. Nevertheless, some pairwise comparisons show much recent divergence times, especially with the harmonic Ne estimates. Recent studies have confirmed this strong geographic substructure for Uralic speakers, but also described a potential, more recent and subtle demographic exchange between long distance Uralic speakers (29), which is not detectable with our FST analysis. For Nakh-Daghestanian (here represented only by pairs within the Daghestanian subfamily), a very ancient root has been proposed, up to 8 kya (49): this language family presents one rare situation where the genetic divergence times are more recent than the ones reconstructed for language divergence, also more recent than the time frame reconstructed with the GBD method.
In eastern Asia, a relatively shallow historical time around 2 kya has been proposed for the Tungusic family (50,51). Of our two divergence times reconstructed, excluding one flagged with a misaligned population, one is compatible with this archaeological and historical estimate, and the other one at ca 3 kya is at the extremes of the range reconstructed with the GBD method. The Turkic family is associated with a similarly shallow origin around 2,500-2,000 years ago, based on contact linguistics (50,52), and supported by Bayesian approaches calibrated with the Seljuk conquest of Baghdad (1055 CE), as the latest date for the divergence of Seljuk-derived languages (Turkish, Azeri, Gagauz) from other Oghuz languages (Turkmen) (53). Only a small number of the divergence time available fits this frame, or the slightly older one reconstructed with the GBD method, while most comparisons (including populations already flagged as misaligned) are much older, and a few even younger than these dates (especially with the harmonic Ne estimates). For Indo-European, we consider the classic dichotomy between the old chronology / Anatolian hypothesis at ~8,000 years ago (54) and the recent chronology / Kurgan steppe hypothesis 5,500 -6,500 years ago (55). The genetic divergence time seems quite old overall, not fitting with the recent chronology but exceeding the limits of the old chronology as well -while the harmonic Ne estimates could also be included in the recent chronology time frame.
Finally, looking at Southeast Asia and the Pacific, very old dates are reconstructed for the Austronesian family, which includes only populations already flagged as misaligned. For this language family, divergence time has been associated with a population expansion from Taiwan towards the Pacific, starting ~5,500 years ago (56,57).

Linguistic time divergence distances for single language families
We focus on three language families to explore the gene-language correlation with one measure of linguistic distance: divergence time. Both genetic divergence times and FST distances are compared against linguistic divergence times for target language families from published sources. Linguistic time splits are extrapolated from trees built with Bayesian statistical methods from lexical dataset based on cognate sets. These reconstructions can be applied only within an established language family, and not across distinct families. External calibration points are used by the authors to anchor the language tree to a time scale, often working with relaxed clock models to allow for the diversification rate to vary across branches. Six linguistic publications have been considered, with the following number of languages matching one or more populations from our genetic database: 32 for Indo-European (data from (58)), 26 for Austronesian (56), 19 for Turkic (53), plus a second Indo-European dataset for 16 matches (59) and second Turkic dataset for 20 matches (60). These last two dataset provided either too recent linguistic splits or a poor coverage when compared to our available genetic divergence times: the results are shown in Figure S22. The reduced number of matches is again due to the fact that not all the genetic populations were usable to calculate Ne and thus the divergence time.
Tree topologies are reconstructed from FST distances, considering genetic populations as taxa. The corresponding linguistic tree is reconstructed for the same number of populations included in GeLaTo for which there is overlap. For different genetic populations who speak the same language, a linguistic distance of zero is applied. The cognacy based trees from the original linguistic publications are compared against Neighbor-Joining trees from the matrix of FST distance and plotted in Figure 4 (panels A-C). For visualization purposes, the two trees are displayed one against each other with optimal branch order. The result replicates the classic Cavalli-Sforza display (61) and can be inspected to highlight correspondences and differences.
Quartet analysis is performed to estimate overall similarities between the two tree topologies (Fig.  S23). The proportion of identical quartets in the Indo-European trees is 0.68, in the Austronesian trees is 0.65 and in the Turkic is 0.57. The value ranges from 1 for a perfect match between the two trees, to zero when all branches are different.
The Indo-European trees are the ones with the larger number of correspondences. Focusing on the mismatches, the early diverging linguistic position of Greek and Albanian groups is not corresponded in the genetic tree; however, it should be noted that they branch off at a very early position in comparison to the other West Eurasian groups (together with the Sicilians The correspondence between linguistic and genetic diversity within subbranches of each language family is also applied on a language-based approach (Fig. S24). Here the linguistic divergence time is considered against the genetic divergence time. Each taxon is a language (not a population, like in the previous set of tree comparisons), and divergent values for speakers of the same language are condensed for each node: the maximum and mean value of all the divergence time is taken for multiple populations speaking the same language and for the upstream genetic coalescences. Finally, for each node, the proportion of the mean linguistic and genetic split time is reported. Turkic is the language family for which linguistic and genetic comparisons show the least correspondences. In Indo-European, the main exception is the branch with Baluchi, Kurdish, Persian and Tadzik, for which the genetic reconstructed divergence times are twice more recent than the linguistic divergence times. In Austronesian, maximum genetic divergence times are particularly ancient, due to possible admixture with pre-Austronesian genetic substrate discussed in the main text. Mean genetic divergence times for splits of major subgroups of the language phylogeny are roughly concordant with the linguistic divergence times. Similar parallels are found with the divergence times within the Polynesian linguistic branch and with the split of Bajo. Within the other Austronesian linguistic branches represented in GeLaTo, genetic divergence times are up to three times older than the linguistic divergence times.   Fig. S24. Language based correspondence of genetic divergence times over linguistic phylogenies. Major language family sub-branches are indicated with vertical text. We retrieved the language divergence time trees from the original publications (with the original language names) and extracted a subset corresponding to populations for which there is a genetic representative in GeLaTo. We plot the genetic divergence times available in GeLaTo (note that not all populations can be used to reconstruct the divergence time) over the linguistic phylogenetic structure. For each pair of languages, the mean of the genetic divergence time was calculated (to account for different genetic populations who speak the same language). Dataset S1 (separate file). The table includes information on the 397 genetic populations included in the analyses: metadata on language association, geographic location, reference source of the data, sample size, and parameters of genetic relatedness calculated for the analyses.
Dataset S2 (separate file). The table includes information on the 157,212 pairwise comparisons for the populations included in the analyses: FST genetic distances, geographic distances, genetic divergence times, and divergence times from linguistic publications.